Parsing Movies in Context

نویسندگان

  • Thomas G. Aguierre Smith
  • Natalio Pincever
چکیده

Traditional approaches in Multimedia systems force the user to segment video material into simple descriptions, which usually include endpoints and a brief text description in the form of keywords. We propose to segment contextual information into chunks rather than segmenting contiguous frames. The computer can help us organize sets of descriptions that are related to the recorded moving image: it can help us remember what we have shot. Such descriptions can overlap, be contained in, and even encompass a multitude of other descriptions. Parsing moving image sequences is reduced to simply parsing the contextual information which forms the description. Our approach, Stratification, also has important ramifications in terms of an elastic representation of moving images. Ambient sound can provide us with important contextual clues as to what is going on within the frames. Using audio to find patterns of content is an important step towards the eventual automatization of the logging process. 1. Approaches for Production: Segmentation versus Stratification Segmentation forces the user to break down the raw footage into segments denoted by the begin and end point and assign some sort of textual description to them. In a video database application, these textual descriptions are searched and the associated video is retrieved. In our research we have found some problems with this approach. On a multimedia system in which it is possible to access individual frames randomly, each frame needs to be described independently. The user has to represent the content of each chunk: the part is forsaken for the whole. In terms of granularity, the chunk of video that a database application can retrieve for a user is predetermined during the logging process. The computer representation of the video footage only encompasses a begin and end point and a text description. Sub samples or finer grained search criteria / representation have to be made independently. In terms of elasticity, if you have a segment which contains 30 frames; the application will retrieve all 30 frames when queried. It has no representation of any set of 10 or 20 frames which form a subset of the base frames. One would get 30 frames and then select a sub sample of these frames, describe them independently so that they can be called up later to satisfy a more finely grained search criteria. Furthermore while segmentation is necessary when editing a video, it imposes a specific intentionality of the person who is placing the material in a structured context. Although this is desirable for a linear editing system, it can seriously impede a group of people who need to use and have access to the same video resources from an archive over a network. If the video material is segmented from the start, how then can the descriptive structures support other users who may need to access the same material for a different purpose? Our solution is to segment contextual information into chunks rather than segmenting contiguous frames. This new context based approach is called Stratification (Aguierre Smith 1991). Begin and end frames are used to segment contextual descriptions for a contiguous set of frames. These descriptions are called strata: a shot begins and ends; a character enters and sits down; the camera zooms in. Each represents a stratum. Any frame can have a variable number of strata associated with it or with a part of it (pixel). The content for any set of frames can be derived by examining the union of all the contextual descriptions that are associated with it. Before we can begin to think about parsing, we need a good representation of moving image content Stratification is a way to structure video content information that will allow us the greatest latitude in terms of parsing. Stratification is a descriptive methodology which generates rich multi-layered descriptions that can be parsed by other applications. 1.1. The Management of Multimedia Resources The movie maker knows a lot about the images as she is recording them. Knowledge about the content of the moving image is at its maximum while being recorded and drops to a minimum while being viewed. The computer can help us organize sets of descriptions that are related to the recorded moving image: it can help us remember what we have shot Successive shots share a set of descriptive attributes that result from their proximity. These attributes are the linear context of the moving image. During recording, the linear context of the frames and the environmental context of the camera coincide. The environmental context is the "where," "who," "what," "when," "why," and "how" which relate to the scene: it's the physical space in which the recording takes place, giving it a unique identity. If you know enough about the environment in which you are shooting, you can derive a good description of the images that you have captured using stratification. Clearly such descriptions can overlap, be contained in, and even encompass a multitude of other descriptions. Each descriptive attribute is an important contextual element: the union of several of these attributes produces the meaning or content for that piece of film. Moreover, each additional descriptive layer is automatically situated within the descriptive strata in which it already exits. In this way, rich descriptions of content can be built. On the other hand, segmentation, which is conventionally used in computer systems which allow the user to log video material, forces the user to break down raw footage into segments denoted by begin and end points: such a divide-and-conquer method forsakes the whole for the part. Coarser descriptions have to be included at this level of specification in order to describe one frame independently. If the unit represented by in and out points is as small as an individual frame, its description, in order to be independent, must encompass larger descriptive units. The granularity of description of sets of frames is directly related to the size of the image units being represented. In segmentation, the granularity of description is inversely proportional to the size of the image unit in question. The inverse relationship arises out of the need to describe each image unit independently. In addition to logging, film makers need tools which will enable them to take segments of raw footage and arrange them to create meaningful sequences. Editing is the process of selecting chunks of footage and sound and rearranging them into a temporal linear sequence (Davenport, Aguierre Smith, Pincever 1990). The edited linear sequence may bear no resemblance to the ambient reality that was in effect during recording. During the process of conventional editing, the original rushes become separated from the final motion picture. In order to create a motion picture with a new meaning which is distinct from the raw footage, a new physical object must be created -an edited print for film, or an edited master for video. Editing on a digital-movie-database system will be radically different. In a full digital system, the link between source material and final movie does not have to be broken. The shot in the final version of the movie being made will be the same chunk of video data as the source material. At this point there are two names for the image unit which must coexist: one reflects the context of the source material; the other is an annotation that is related to a playback time for a personalized movie script.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...

متن کامل

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Oral Requests in Advanced Level English Coursebooks and English Movies: An Evaluative Study

  Concerns over the phony nature of textbooks and artificiality of their contents in reflecting authentic language have been raised by a number of researchers. It has been argued that many language teaching programs result in the failure of learners of English to successfully communicate in the target language. The problem with these programs is that there is no general agreement about the succ...

متن کامل

Exploiting the Semantic Web for Unsupervised Natural Language Semantic Parsing

In this paper, we propose to bring together the semantic web experience and statistical natural language semantic parsing modeling. The idea is that, the process for populating knowledgebases by semantically parsing structured web pages may provide very valuable implicit annotation for language understanding tasks. We mine search queries hitting to these web pages in order to semantically annot...

متن کامل

Contextual Information Extraction for Video Data. Paper presented at the 9

Specific domains in video data contain rich temporal structures that help in classification process. In sports, the events that unfold are governed by the rules of the sport and hence, contain a recurring temporal structure. The rules of cinematographic production are also standardized. The classification of video data involves extracting patterns in the temporal behavior of each variable and a...

متن کامل

The wagon-wheel illusion in continuous light.

The fact that a perceptual experience akin to the familiar wagon-wheel illusion in movies and on TV can occur in the absence of stroboscopic presentation is intriguing because of its relevance to visuo-temporal parsing. The wagon-wheel effect in continuous light has also been the source of considerable misunderstanding and dispute, as is apparent in a series of recent papers. Here we review thi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1991